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Abstract 

We consider the following multi-component sparse PCA problem: given a set of data points, 
we seek to extract a small number of sparse components with disjoint supports that jointly 
capture the maximum possible variance. These components can be computed one by one, 
repeatedly solving the single-component problem and deflating the input data matrix, but as 
we show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA 
that jointly optimizes multiple disjoint components. The extracted features capture variance 
that lies within a multiplicative factor arbitrarily close to 1 from the optimal. Our algorithm 
is combinatorial and computes the desired components by solving multiple instances of the 
bipartite maximum weight matching problem. Its complexity grows as a low order polynomial 
in the ambient dimension of the input data matrix, but exponentially in its rank. However, 
it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtain 
polynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm on 
real data-sets and empirically demonstrate that in many cases it outperforms existing, deflation- 
based approaches. 


1 Introduction 


Principal Component Analysis (PCA) reduces the dimensionality of a data set by projecting it onto 
principal subspaces spanned by the leading eigenvectors of the sanmle covariance matrix. Sparse 
PCA is a useful variant that offers higher data interpretability a property that is sometimes 

desired even at the cost of statistical hdelity [^. Furthermore, when the obtained features are used 
in subsequent learning tasks, sparsity potentially leads to better generalization error j^. 

Given a real n x d data matrix S representing n centered data points supported on d features, 
the leading sparse principal component of the data set is the sparse vector that maximizes the 
explained variance: 


A T A 

X* = argmax x Ax, 

||x||2 = l,||x||o=S 


( 1 ) 


where A = i/n • S''^S is the d x d empirical covariance matrix. The sparsity constraint makes 
the problem NP-hard and hence computationally intractable in general, and hard to approximate 
within some small constant [^. A significant volume o^rior work has focused on algorithms that 
approximately solve the optimization problem 0,i,3,0BS [I3,EI,[il,E3J 0, flbl. w hile a large 
volume of theoretical results has been established under planted statistical models jl6l . 0, 0, 0, 

0 , 0 , 0 , 0 , 01 - 
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In most practical settings, we tend to go beyond computing a single sparse PC. Contrary to the 
single-component problem, there has been limited work on computing multiple components. The 
scarcity is partially attributed to conventional PCA wisdom: multiple components can be computed 
one-by-one, repeatedly, by solving the single-component sparse PCA problem ([T]) and deflating the 
input data to remove information captured by previously extracted components [2^. In fact, 
the multi-component version of sparse PCA is not uniquely defined in the literature. Different 
deflation-based approaches can lead to different outputs: extracted components may or may not be 
orthogonal, while they may have disjoint or overlapping supports 25|]. In the statistics literature, 
where the objective is typically to recover a “true” principal subspace, a branch of work has focused 
on the “subspace row sparsity” , an assumption that leads to sparse components all supported 
on the same set of variables. While in 27|, the authors discuss an alternative perspective on the 
fundamental objective of the sparse PCA problem. 

In this work, we develop a novel algorithm for the multi-component sparse PCA problem with 
disjoint supports. Formally, we are interested in finding k components that are s-sparse, have 
disjoint supports, and jointly maximize the explained variance: 


X* = argmaxTR(X^AX), 

XGA'fc 


( 2 ) 


where the feasible set is 


Afc ^ {X G : llX^lls = 1, ||X^||o = s, supp(X*) n supp(XJ') = 0, Vj G [k]fl < j}, 

with X-^ denoting the jth column of X. The number k of the desired components is a user defined 
parameter and we consider it to be a small constant. 

Contrary to the greedy sequential approach that repeatedly uses deflation, our algorithm jointly 
computes all the vectors in X, and comes with theoretical approximation guarantees. We note that 
even if one could solve each single-component sparse PCA problem ([1]) exactly, greedy deflation 
can be highly suboptimal. We show this through a simple example in Section 0 


Our Contributions 

1. We develop an algorithm that provably approximates the solution to the sparse PCA problem ([2]) 
within a multiplicative factor arbitrarily close to 1. To the best of our knowledge, this is the 
first algorithm that jointly optimizes multiple components with disjoint supports, provably. Our 
algorithm is combinatorial; it recasts sparse PCA as multiple instances of bipartite maximum 
weight matching on graphs determined by the input data. 

2. The computational complexity of our algorithm grows as a low order polynomial in the ambient 
dimension d, but is exponential in the intrinsic dimension of the input data, i.e., the rank 
of A. To alleviate the impact of this dependence, our algorithm can be applied on a low¬ 
dimensional sketch of the input data to obtain an approximate solution to ([2]) . This extra level 
of approximation introduces an additional penalty in our theoretical approximation guarantees, 
which naturally depends on the quality of the sketch and, in turn, the spectral decay of A. We 
show how these bounds further translate to an additive PTAS (polynomial-time approximation 
scheme) for sparse PCA. Our additive PTAS outputs an approximate solution with explained 
variance of at least OPT — e ■ s, for any sparsity s G {1, ... ,n}, any constant error e > 0 and 
any k = 0(1) number of orthogonal components0 

^Here, OPT is the explained variance captured by the optimal set of k components that are s sparse and have 
disjoint supports. 
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3. We empirically evaluate our algorithm on real datasets, and compare it against state-of-the-art 
methods for the single-component sparse PC A problem m in conjunction with the appro¬ 
priate deflation step. In many cases, our algorithm—as a result of jointly optimizing over 
multiple components—leads to significantly improved results, and outperforms deflation-based 
approaches. 


2 Sparse PCA through Bipartite Matchings 

Our algorithm approximately solves the constrained maximization ([2]) on a d x d rank-r PSD 
matrix A within a multiplicative factor arbitrarily close to 1. It operates by recasting the maxi¬ 
mization into multiple instances of the bipartite maximum weight matching problem. Each instance 
ultimately yields a feasible solution; a set of k components that are s-sparse and have disjoint sup¬ 
ports. The algorithm examines these solutions, and outputs the one that maximizes the explained 
variance, i.e., the quadratic objective in (j^. 

The computational complexity of our algorithm grows as a low order polynomial in the ambient 
dimension d of the input, but exponentially in its rank r. Despite the unfavorable dependence on 
the rank, it is unlikely that a substantial improvement can be achieved in general [^. However, de¬ 
coupling the dependence on the ambient and the intrinsic dimension of the input has an interesting 
ramification; instead of the original input A, our algorithm can be applied on a low-rank surrogate 
to obtain an approximate solution, alleviating the dependence on r. We discuss this in Section [3l 
and present the approximation bound that this allows us to obtain. 

Let A = UAU^ denote the truncated eigenvalue decomposition of A; A is a diagonal r x r 
whose ith diagonal entry is equal to the zth largest eigenvalue of A, while the columns of U are 
the corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x G M'^, 

x'''Ax = IIA^/^U'''x||2 > (A^/^U"''x, c)^, V c G M'’ : ||c||2 = 1. (3) 

In fact, equality in ([3]) can always be achieved for c colinear to A^/^Ux G and in turn 

x^Ax = max (x, UA^/2 j,\2 

where denotes the ^ 2 -unit sphere in r dimensions. More generally, for any X G 

k k 

Tr^X^AX^ = ^^X'^'''aX'^ = max ^ UA^/^C-^ )'• (4) 

j=l C;CJG§2 j—i 

Under the variational characterization of the trace objective in Q, the sparse PCA problem ([2]) 
can be re-written as a joint maximization over the variables X and C as follows: 

k 

max Tr(X’^AX) = max max V/X^', UAi/2ci\2_ 

The alternative formulation of the sparse PCA problem in ([5]) takes a step towards decoupling the 
dependence of the optimization on the ambient and intrinsic dimensions d and r, respectively. The 
motivation behind the introduction of the auxiliary variable C will become clear in the sequel. 

For a given C, the value of X G Afc that maximizes the objective in ([5]) for that C is 

k 

X = arg max ^ (X^ W-^' >', (6) 

xeR’fc 
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where W=UA^/^C is a real d x k matrix. The constrained, non-convex maximization ([6]) plays a 
central role in our developments. We will later describe a combinatorial 0{d ■ {s ■ /c)^) procedure to 
efficiently compute X, reducing the maximization to an instance of the bipartite maximum weight 
matching problem. For now, however, let us assume that such a procedure exists. 

Let X*, C* be the pair that attains the maximum in ([5]); in other words, X* is the desired 
solution to the sparse PCA problem. If the optimal auxiliary variable C* was known, then we 
would be able to recover X* by solving the maximization Q for C = C*. Of course, C* is not 
known, and it is not possible to exhaustively consider all possible values in the domain of C. 
Instead, we examine only a finite number of possible values of C over a fine discretization of its 
domain. In particular, let A/'e/ 2 (§ 2 ~^) denote a finite <^/2-net of the r-dimensional £ 2 -unit sphere; for 
any point in the net contains a point within an ^/2 radius from the former. There are several 
ways to construct such a net [ 2 ^ . Further, let [A/'e/ 2 (S 2 ~^)]®^ C denote the A:th Cartesian 

power of the aforementioned '^/ 2 -net. By construction, this collection of points contains a matrix C 
that is column-wise close to C*. In turn, it can be shown using the properties of the net, that the 
candidate solution X G obtained through ([6]) at that point C will be approximately as good as 
the optimal X* in terms of the quadratic objective in (j2]) . 

All above observations yield a 
procedure for approximately solving 


steps are outlined in Algorithm [TJ 
Given the desired number of com¬ 
ponents k and an accuracy parame¬ 
ter e G (0,1), the algorithm gener¬ 
ates a net [A/’€/ 2 (S 5 ~^)]'^^ iterates 
over its points. At each point C, it 
computes a feasible solution for the 
sparse PCA problem - a set of k s- 
sparse components - by solving the 
maximization in Q via a procedure 
(Alg. [ 2 ]) that will be described in the 
sequel. The algorithm collects the candidate solutions identified at the points of the net. The best 
among them achieves an objective in ([2]) that provably lies close to optimal. More formally. 

Theorem 1. For any real dx d rank-r PSD matrix A, desired number of components k, number s 
of nonzero entries per component, and accuracy parameter e G (0,1), Algorithmic outputs X G Tfc 
such that 

Tr(X^AX) > (l-e)-TR(xjAX*), 

where X*= argmaxxg;^^, Tr(X^AX) , in time Tsvd{t) + ■d-{s-k)‘^). 

Algorithm [1] is the first nontrivial algorithm that provably approximates the solution of the 
sparse PCA problem ([2]). According to Theorem [U it achieves an objective value that lies within 
a multiplicative factor from the optimal, arbitrarily close to 1. Its complexity grows as a low-order 
polynomial in the dimension d of the input, but exponentially in the intrinsic dimension r. Note, 
however, that it can be exponentially faster compared to the 0{d^'^) brute force approach that 
exhaustively considers all candidate supports for the k sparse components. The complexity of our 
algorithm follows from the cardinality of the net and the complexity of Algorithm [2l the subroutine 
that solves the constrained maximization ([6]) . The latter is a key ingredient of our algorithm, and is 
discussed in detail in the next subsection. A formal proof of Theorem [T] is provided in Section [9.21 


the sparse PCA problem ([2]). The 


Algorithm 1 Sparse PCA (Multiple disjoint components) 

input : PSD d x d rank-r matrix A, e G (0, 1), /c G Z_|_. 
output : X G {Theorem [1]} 

1: C ^ {} 

2: [U,A]^EIG(A) 

3: for each C G [A/1/2(S2~^)]'^^ do 
4: W ^ UAV2c {W G 

5: X ^ arg maxxgA-, (X^' , (Alg. [2]} 

6: C^CujX} 

7: end for 

8: X •(— argmaxxgc Tr(X^AX) 
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2.1 Sparse Components via Bipartite Matchings 

In the core of Algorithm [1] lies Algorithm [21 a procedure that solves the constrained maximization 
in ([6]). The algorithm breaks down the maximization into two stages. First, it identifies the support 
of the optimal solution X. Determining the support reduces to an instance of the maximum 
matching problem on a weighted bipartite graph G. Then, it recovers the exact values of the 
nonzero entries in X based on the Cauchy-Schwarz inequality. In the sequel, we provide a brief 
description of Algorithm [21 leading up to its guarantees in Lemma l2.II 

Let Zj=supp(X-^) be the support of the jth column of X, j = I,... ,k. The objective in ([6]) 
becomes 


A ■ < E E (7) 

j=l j=l i£lj j=l i£lj 

The last inequality is an application of the Cauchy-Schwarz Inequality and the constraint ||X-^ lb = 1 
Vj G {1,..., k}. In fact, if an oracle reveals the supports Ij, j = 1, ..., k, the upper bound in ([7]) 
can always be achieved by setting the nonzero entries of X as in Algorithm [2] (Line 6). Therefore, 
the key in solving ([^ is determining the collection of supports to maximize the right-hand side 
of dZl). 

By constraint, the sets Zj must be pairwise dis¬ 
joint, each with cardinality s. Consider a weighted 
bipartite graph G = (U = {Ui ,..., Uk}, V, E) con¬ 
structed as follow^ (Fig. H]): 

• C is a set of d vertices vi,... ,Vd, corresponding to 
the d variables, ie., the d rows of X. 

• [/ is a set of A: • s vertices, conceptually partitioned 
into k disjoint subsets Ui,... ,Uk, each of cardinal¬ 
ity s. The jth subset, Uj, is associated with the 
support Zj] the s vertices ttab a = 1, ..., s in Uj 
serve as placeholders for the variables/indices in Zj. 

• Finally, the edge set is E = U x V. The edge 
weights are determined by the dxk matrix W in ([6]) . 

In particular, the weight of edge {ua\vi) is equal 
to W^j. Note that all vertices in Uj are effectively 
identical; they all share a common neighborhood 
and edge weights. 

Any feasible support corresponds to a perfect matching in G and vice-versa. Recall that 

a matching is a subset of the edges containing no two edges incident to the same vertex, while a 
perfect matching, in the case of an unbalanced bipartite graph G = {U,V,E) with |17| < |R|, is a 
matching that contains at least one incident edge for each vertex in U. Given a perfect matching 
M. C E, the disjoint neighborhoods of UjS under Ai yield a support Conversely, any 

valid support yields a unique perfect matching in G (taking into account that all vertices in Uj 
are isomorphic). Moreover, due to the choice of weights in G, the right-hand side of ([7|) for a 
given support is equal to the weight of the matching A4 in G induced by the former, i.e., 

^The construction is formally outlined in Algorithm [4] in Section [8l 



Figure 1: The graph G generated by 
Alg. [21 It is used to determine the support 
of the solution X in ([6|). 
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Algorithm 2 Compute Candidate Solution 

input Real d x k matrix W 


output X = argmaxxgA-fe 


1 

G{{Uj]]^^,V,E) ^ GenBiGraph(W) 

{Aig. m} 

2 

M ■«— MaxWeightMatch(G) 

{CS} 

3 

X ■<— Odxk 


4 

for j = 1,... ,k do 


5 

i — \i G 1,..., (i} ! (u, 'U'j) G TW, G U 
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end for 



vj^j^w(u,v). It follows that determining the support of the solution in ([6]), 
reduces to solving the maximum weight matching problem on the bipartite graph G. 

Algorithm [2] readily follows. Given W E the algorithm generates a weighted bipartite 

graph G as described, and computes its maximum weight matching. Based on the latter, it first 
recovers the desired support of X (Line 5), and subsequently the exact values of its nonzero entries 
(Line 6). The running time is dominated by the computation of the matching, which can be done 
in ©(iLillC/l + |C/p log |C/|) using a variant of the Hungarian algorithm [^. Hence, 

Lemma 2.1. For any W E Algorithmic computes the solution to ([6]), in time 0[d ■ (s ■ . 

A more formal analysis and proof of Lemma 12.11 is available in Section 19.11 With Algorithm [2] 
and Lemma l2.II in place, we complete the description of our sparse PCA algorithm (Algorithm [1]) 
and the proof sketch of Theorem [TJ 


3 Sparse PCA on Low-Dimensional Sketches 


Algorithm [T] approximately solves the 
sparse PCA problem ([2|) on a d x d rank-r 
PSD matrix A, in time that grows as a 
low-order polynomial in the ambient dimen¬ 
sion d, but depends exponentially on r. This 
dependence can be prohibitive in practice. 

To mitigate its effect, instead of the origi¬ 
nal input, we can apply our sparse PCA al¬ 
gorithm on a low-rank approximation of A. 

Intuitively, the quality of the extracted components should depend on how well that low-rank 
surrogate approximates the original input. 

More formally, let S be the real n x d data matrix representing n (potentially centered) dat- 
apoints in d variables, and A the corresponding d x d covariance matrix. Further, let S be a 
low-dimensional sketch of the original data; an n x d matrix whose rows lie in an r-dimensional 
subspace, with r being an accuracy parameter. Such a sketch can be obtained in several ways, 
including for example exact or approximate SVD, or online sketching methods [^. Finally, let 
A = i/n • S^S be the covariance matrix of the sketched data. Then, instead of A, we can approxi¬ 
mately solve the sparse PCA problem by applying Algorithm [1] on the low-rank surrogate A. The 
above are formally outlined in Algorithm [3l We note that the covariance matrix A does not need 
to be explicitly computed; Algorithm [1] can operate directly on the (sketched) input data matrix. 


Algorithm 3 Sparse PCA on Low Dim. Sketch 

input : Real n X d S, r E Z_|_, e E (0,1), k E Z_|_. 
output X(^) E Afc. {Thm. [2]} 

1: S ^ SKETCH(S,r) 

2 : A ^ S'^S 

3: X(^) •(-Algorithm [U (A, e,/c). 
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Theorem 2. For any n x d input data matrix S, with corresponding empirical covariance matrix 
A = i/n • S'''S, any desired number of components k, and accuracy parameters e G (0,1) and r, 
Algorithmic outputs X(r) G such that 

Tr(xTAXm) > (l-e)-TR(XjAX,)-2-fc-Ai,,(A-A), 


in time TsKETcnir) + Tswir) + ■ d ■ {s ■ kf'). Here, X*= argmaxxeA’fc Tr(X^AX), and 

Ai^s(A) denotes the sparse eigenvalue, i.e., the eigenvalue that corresponds to the principal s-sparse 
eigenvector of A. 


The error Ai^s(A — A) and in turn the tightness of the approximation guarantees hinges on the 
quality of the sketch A. Higher values of the parameter r (the rank of the sketch) can allow for 
a more accurate solution and tighter guarantees. That is the case, for example, when the sketch 
is obtained through exact SVD. In that sense. Theorem [2] establishes a natural trade-off between 
the running time of Algorithm [3] and the quality of the approximation guarantees. A formal proof 
of Theorem [2] is provided in Section 19.31 Observe that the error term itself is a sparse eigenvalue 
that is hard to approximate, however even loose bounds provide tight conditional approximation 
results, as we see next. 

Using the main matrix approximation result of 3l|, the next theorem establishes that Algo¬ 
rithm [3] can be turned into an additive PTAS. 


Theorem 3. Let A he a d x d positive semidefinite matrix with entries in [—1, 1], 'V be a d x d 
matrix such that A = VV'''. Further, let R 6e a random d x r matrix with entries drawn i.i.d. 
according to M{Q,l/r), and define 

A = VRR^V^. 

For any constant e G (0,1], let r = 0(e“^logd). Then, for any desired sparsity s, and number 
of components k = 0(1), AlgorithmUl with input argument A and accuracy parameter e, outputs 
X(,.) G Tfc such that 

Tr(xJ,AX(„)) > Tr(xJAX*)- e-s 
with probability at least 1 — l/poly{d), in time . 

Remark 3.1. Note that Ai(A — A) serves as another elementary upper bound on Ai^s(A — A). If 
A is a the rank-d SVD approximation of A, then—similar to ]3^} —we can obtain a multiplicative 
PTAS for sparse PCA, under the assumption of a decaying spectrum (e.g., under a power-law 
decay), and for s = Ll{n). 


4 Related Work 

We are not aware of any algorithm with provable guarantees for sparse PCA with disjoint supports. 
Multiple components can be extracted by repeatedly solving ([T]) using one of the aforementioned 
methods. To ensure disjoint supports, variables “selected” by a component are removed from 
the dataset. This greedy approach, however, can result in highly suboptimal objective value (See 
example in Sec. ED. 

A significant volume of work has focused on the single-component sparse PCA problem ([ID; 
we scratch the surface and refer the reader to citations therein. Representative examples range 
from early heuristics in Qj, to the LASSO based techniques in j^, the elastic net .^i-regression in 
[^, ii and £q regularized optimization methods such as GPower in [^, a greedy branch-and-bound 
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technique in j^, or semidefinite programming approaches BEQ. The authors of present 
an approach that uses ideas from au expectatiou-maximizatiou (EM) formulation of the problem. 
More recently, presents a simple and very efficient truncated version of the power iteration 
(TPower). Finally, la] introduces an exact solver for the low-rank case of the problem; this solver 
was then used on low-rank sketches in the work of (SpanSPCA), that provides conditional 
approximation guarantees under spectral assumptions on the input data. Several ideas in this work 
are inspired by the aforementioned low-rank solvers. In our experiments, we compare against EM, 
TPower, and SpanSPCA, which all are experimentally achieving state-of-the-art performance. 

Parallel to the algorithmic and optimization perspective, there is large line of statistical analysis 
for sparse PC A that focuses on guarautees pertaiuiug to planted models and the recovery of a “true” 
sparse componeut 0, Ii3,0 0,0,0,0,0,0|- 

There has beeu some work ou the explicit estimation of principal subspaces or multiple compo¬ 
nents under sparsity constraints. Non-deflation-based algorithms include extensions of the diagonal 
thresholding algorithm [0 ] and iterative thresholding approaches 17], while 34] and 


351 ] propose 


methods that rely on the “row sparsity for subspaces” assumption of 2^. These methods yield 


components supported on a common set of variables, and hence solve a problem different from ([2]). 
Magdon-Ismail and Boutsidis [271] discuss the multiple component Sparse PCA problem, propose 
an alternative objective function and for that problem obtain interesting theoretical guarantees. 
Finally, 3^ develops a framework for sparse matrix factorizaiton problems, based on a novel atomic 
norm. That framework captures sparse PCA - although uot explicitly the constraint of disjoint 
supports ~ but the resulting optimization problem, albeit convex, is NP-hard. 


5 Experiments 

We evaluate our algorithm on a series of real datasets, and comp^ it to deflation-based approaches 
for sparse PCA using TPower [I3|, EM 13], and SpanSPCA 


iM 


The latter are representative 
of the state of the art for the single-component sparse PCA problem ([1]). Multiple components 
are computed one by one. To ensure disjoint supports, the deflation step effectively amounts to 
removing from the dataset all variables used by previously extracted components. For algorithms 
that are randomly initialized, we depict best results over multiple random restarts. Additional 
experimental results are listed in Section [TT] of the appendix. 

Our experiments are conducted in a Matlab environment. Due to its nature, our algorithm 
is easily parallelizable; its prototypical implemeutation utilizes the Parallel Pool Matlab feature 
to exploit multicore (or distributed cluster) capabilities. Recall that our algorithm operates on a 
low-rank approximation of the input data. Unless otherwise specified, it is configured for a rank-4 
approximation obtained via truncated SVD. Finally, we put a time barrier in the execution of 
our algorithm, at the cost of the theoretical approximation guarantees; the algorithm returns best 
results at the time of termination. This “early termination” can only hurt the performance of our 
algorithm. 


Leukemia Dataset. We evaluate our algorithm on the Leukemia dataset 37j. The dataset 
comprises 72 samples, each consisting of expression values for 12582 probe sets. We extract k = 5 


sparse components, each active on s = 50 features. In Fig. 2(a) we plot the cumulative explained 
variance versus the number of components. Deflation-based approaches are greedy: the leading 
components capture high values of variance, but subsequeut ones contribute less. Ou the coutrary, 
our algorithm joiutly optimizes the k = 5 components and achieves higher total cumulative variance; 
one cannot identify a top component. We repeat the experiment for multiple values of k. Fig. |2(b)| 
depicts the total cumulative variance capture by each method, for each value of k. 






















xIO^ A: = 5 components, s = 50 nnz/component 



(a) 


s = 50 nnz/component 



23456789 10 


Number of target components 

(b) 


Figure 2: Cumulative variance captured by the k s-sparse extracted components - Leukemia 
dataset 37|. Sparsity is arbitrarily set to s = 50 nonzero entries per component. Fig. 2(a) de¬ 


picts the cum. variance versus the number of components, for k = 5. Deflation-based approaches 
are greedy; first components capture high variance, but subsequent ones contribute less. Our algo¬ 
rithm jointly optimizes the k = 5 components and achieves higher total cum. variance. Fig. |2(b)| 
depicts the total cum. variance achieved for various values of k. 


Additional Datasets. We repeat the experiment on multiple datasets, arbitrarily selected 
from 37|. Table [T] lists the total cumulative variance captured by A: = 5 components, each with 
s = 40 nonzero entries, extracted using the four methods. Our algorithm achieves the highest values 
in most cases. 

Bag of Words (BoW) Dataset. 37]] This is a collection of text corpora stored under the 
“bag-of-words” model. For each text corpus, a vocabulary of d words is extracted upon tokenization, 
and the removal of stopwords and words appearing fewer than ten times in total. Each document 
is then represented as a vector in that d-dimensional space, with the ith entry corresponding to the 
number of appearances of the 7th vocabulary entry in the document. 

We solve the sparse PC A problem ([2|) on the word-by-word cooccurrence matrix, and extract 
k = 8 sparse components, each with cardinality s = 10. We note that the latter is not explicitly 
constructed; our algorithm can operate directly on the input word-by-document matrix. Table [2] 
lists the variance captured by each method; our algorithm consistently outperforms the other 




TPower 

EM sPCA 

SpanSPCA 

SPCABiPart 

Amzn Com Rev 

(1500x10000) 

7.31e + 03 

7.32e + 03 

7.31e + 03 

7.79e + 03 

Aroence Train 

(100x10000) 

1.08e + 07 

1.02e + 07 

1.08e + 07 

l.lOe + 07 

CBCL Face Train 

(2429x361) 

5.06e + 00 

5.18e + 00 

5.23e + 00 

5.29e + 00 

Isolet-5 

(1559x617) 

3.31e + 01 

3.43e + 01 

3.34e + 01 

3.51e + 01 

Leukemia 

(72x12582) 

5.00e + 09 

5.03e + 09 

4.84e + 09 

5.37e + 09 

Pems Train 

(267x138672) 

3.94e + 00 

3.58e + 00 

3.896 -|- 00 

3.75e + 00 

Mfeat Pix 

(2000x240) 

5.00e + 02 

5.27e + 02 

5.08e + 02 

5.47e + 02 


Table 1: Total cumulative variance captured by A: = 5 40-sparse extracted components on various 
datasets [37(1 . For each dataset, we list the size (^samplesx ^variables) and the value of variance 


captured by each method. Our algorithm operates on a rank-4 sketch in all cases. 
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TPower 

EM sPCA 

SpanSPCA 

SPCABiPart 

BoW:NIPS 

(1500x12419) 

2.51e-|-03 

2.57e-|-03 

2.53e-|-03 

3.34e -1- 03 (+29.98%) 

BoW:KOS 

(3430x6906) 

4.14e-|-01 

4.24e-|-01 

4.21e-|-01 

6.14e + 01 (+44.57%) 

BoW:Enron 

(39861x28102) 

2.11e-|-02 

2.00e -1- 02 

2.09e-|-02 

2.38e + 02 (+12.90%) 

BoWiNyTimes 

(300000x102660) 

4.81e-|-01 

- 

4.81e-|-01 

5.31e + 01 (+10.38%) 


Table 2: Total variance captured by A; = 8 extracted components, each with s = 15 nonzero entries 


- Bag of Words dataset [37[. For each corpus, we list the size (^documents x ^vocabulary-size) 


and the explained variance. Our algorithm operates on a rank-5 sketch in all cases. 


approaches. 

Finally, note that here each sparse component effectively selects a small set of words. In turn, 
the k extracted components can be interpreted as a set of well-separated topics. In Table El we list 
the topics extracted from the NY Times corpus (part of the Bag of Words dataset). The corpus 
consists of 3 • 10^ news articles and a vocabulary of d = 102660 words. 



Topic 1 

Topic 2 

Topic 3 


Topic 4 

Topic 5 

Topic 6 

Topic 7 

Topic 8 

i 

percent 

zzz_united_states 

zzz.bush 


company 

team 

cup 

school 

zzz_al_gore 

2 

million 

zzz_u_s 

official 


companies 

game 

minutes 

student 

zzz_george_bush 

3 

money 

zzz.american 

government 

market 

season 

add 

children 

campaign 

4 

high 

attack 

president 

stock 

player 

tablespoon 

women 

election 

5 

program 

military 

group 


business 

play 

oil 

show 

plan 

6 

number 

Palestinian 

leader 


billion 

point 

teaspoon 

book 

tax 

7 

need 

war 

country 


analyst 

run 

water 

family 

public 

8 

part 

administration 

political 


firm 

right 

pepper 

look 

zzz.washington 

9 

problem 

zzz_white_house 

american 

sales 

home 

large 

hour 

member 

10 

com 

games 

law 



cost 

won 

food 

small 

nation 

Table 3: 

BoW:NyTimes dataset 

37 

|. The table lists the words corresponding to the 


nonzero entries of each of the k = 8 extracted components (topics). Words corresponding to higher 
magnitude entries appear higher in the topic. 


6 Conclusions 

We considered the sparse PCA problem for multiple components with disjoint supports. Existing 
methods for the single component problem can be used along with an appropriate deflation step to 
compute multiple components one by one, leading to potentially suboptimal results. We presented 
a novel algorithm for jointly optimizing multiple sparse and disjoint components with provable 
approximation guarantees. Our algorithm is combinatorial and exploits interesting connections 
between the sparse PCA and the bipartite maximum weight matching problems. It runs in time 
that grows as a low-order polynomial in the ambient dimension of the input data, but depends 
exponentially on its rank. To alleviate this dependency, we can apply the algorithm on a low¬ 
dimensional sketch of the input, at the cost of an additional error in our theoretical approximation 
guarantees. Empirical evaluation of our algorithm demonstrated that in many cases it outperforms 
deflation-based approaches. 
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Supplemental Material 

7 On the sub-optimality of deflation — An example 

We provide a simple example demonstrating the sub-optimality of deflation based approaches for 
computing multiple sparse components with disjoint supports. Gonsider the real 4x4 matrix 

‘l 0 0 e‘ 

, 0 J 0 0 

A = 

0 0 J 0 ’ 
e 0 0 1 

with e,d > 0 such that e -|- J < 1. Note that A is PSD; A = B^B for 

‘l 0 0 e ‘ 

g _ 0 \/j 0 0 

“ 0 0 ^/J 0 

0 0 0 Vl - e2 

We seek two 2-sparse components with disjoint supports, i.e., the solution to 

2 

max xjAxi, (8) 

i=i 


where 


X= {X G : ||xi ||2 < 1, ||xj||o < 2 Vi G {1, 2},supp(xi) n supp(x 2 ) = 0} . 
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Iterative computation with deflation. Following an iterative, greedy procednre with a 
deflation step, we compute one component at the time. The first component is 

xi = argmax x^Ax. (9) 

||x||o=2,||x||2=1 

Recall that for any unit norm vector x with support I = supp(x), 

x"^Ax < Amax (A/,/), (10) 

where Ajj denotes the principal submatrix of A formed by the rows and columns indexed by I. 
Equality can be achieved in for x equal to the leading eigenvector of Ajj. Hence, it suffices to 
determine the optimal support for xi. Due to the small size of the example, it is easy to determine 
that the set Ii = {1,4} maximizes the objective in (llOp over all sets of two indices, achieving value 


x7 Axi = Aj. 


1 e 
e 1 


= 1 + e. 


( 11 ) 


Since subsequent components must have disjoint supports, it follows that the support of the second 
2-sparse component X 2 is I 2 = {2,3}, and X 2 achieves value 


x{ Ax 2 = An 


<5 0 

0 <5 


< 5 . 


( 12 ) 


In total, the objective value in ([8]) achieved by the greedy computation with a deflation step is 


2 


J^xjAx, 

i=i 


= 1 + e + 5. 


(13) 


The sub-optimality of deflation. Consider an alternative pair of 2-sparse components x} and X 2 
with support sets /( = {1, 2} and = {3,4}, respectively. Based on the above, such a pair achieves 
objective value in ([8j) equal to 


A 


max 


1 0 

0 (5 


+ An 


<5 0 

0 1 


1 + 1 = 2 , 


which clearly outperforms the objective value in (|13ll (under the assumption e + < 1), demon¬ 

strating the sub-optimality of the xi, X 2 pair computed by the deflation-based approach. In fact, 
for small e, 6 the objective value in the second case is larger than the former by almost a factor of 
two. 


8 Construction of Bipartite Graph 

The following algorithm formally outlines the steps for generating the bipartite graph G = V, 

given a weight d x k matrix W. 
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Algorithm 4 Generate Bipartite Graph 

input Real d x k matrix W 


output Bipartite G 

II 

S 

jl 

{Fig. m 

1 

for 7 = 1,..., A; do 


2 

Uj^{u^^\... 

,V'} 


3 

end for 



4 

U £- 


{\U\=k-s} 

5 

V^{l,...,d} 



6 

E^U xV 



7 

for i = 1,... ,d do 


8 

for j = 1 ,..., 

k do 


9 

for each u 

G Uj do 


10 

w{u,Vi) G 

- W^- 

V 


11 

end for 


12 

end for 



13 

end for 




9 Proofs 

9.1 Guarantees of Algorithm [2] 

Lemma 12.11 For any real d x k matrix W, and Algorithmic outputs 

k 

X = arg max , W-^ )' (14) 

in time 0[d ■ [s ■ A:)^). 

Proof. Consider a matrix X G and let Ij, j = 1... ,k denote the support sets of its columns. 
By the constraints in Xk, those sets are disjoint, i.e., Ij^ n Ij^ = 0 Vji, J 2 G {1, ■ ■ ■, k},ji / j 2 , and 

= (E ^ E(E »'«) ■ (15) 

j=i j=i ieij j=i ie/j 

The last inequality is due to Cauchy-Schwarz and the fact that ||Xl II 2 < 1, Vj G {l,...,k}. In 
fact, if the supports sets Ij, j = 1,..., k were known, the upper bound in (fTSll would be achieved 
by setting Xj. = Wj,/||W}, || 2 , i.e., setting the nonzero subvector of the jth column of X colinear 
to the corresponding subvector of the jth column of W. Hence, the key step towards computing 
the optimal solution X is to determine the support sets Ij, j = 1,... ,k of its columns. 

Consider the set of binary matrices 

Z= |z G {0,: llZ^llo < s Vj G [fe],supp(Z*) C supp(Z^) = 0 Vi, j G [k],i 7 ^ j| . 

The set represents all possible supports for the members of Taking into account the previous 
discussion, the maximization in (1141) can be written with respect to Z £ Z: 

k k d 
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Let Z G denote the optimal solution, which corresponds to the (support) indicator of X. Next, 
we show that computing Z boils down to solving a maximum weight matching problem on the 
bipartite graph generated by Algorithm 01 Recall that given W G Algorithm 0] generates a 

complete weighted bipartite graph G = {U, V, E) where 

• R is a set of d vertices vi,... ,Vd, corresponding to the d variables, i.e., the d rows of X. 

• [/ is a set of fe • s vertices, conceptually partitioned into k disjoint subsets Ui,... ,Uk, each of 
cardinality s. The jth subset, Uj, is associated with the support the s vertices Ua\ a = 1,..., s 
in Uj serve as placeholders for the variables/indices in Xj. 

• Finally, the edge set is E = U x V. The edge weights are determined by the d x k matrix W 
in dH). In particular, the weight of edge {ua\vi) is equal to IF?-. Note that all vertices in Uj are 
effectively identical; they all share a common neighborhood and edge weights. 

It is straightforward to verify that any Z G corresponds to a perfect matching in G and vice 
versa; Zij = 1 if and only if vertex Uj G F is matched with a vertex in Uj (all vertices in Uj are 
equivalent with respect to their neighborhood). Further, the objective value in (|16p for a given 
Z G is equal to the weight of the corresponding matching in G. More formally, 

• Given a perfect matching Ad, the support Ij of the jih. column of Z is determined by the 
neighborhood of Uj in the matching: 

Ij[i e[d]\ {u,Vi) e M,ueUj], j = l,...,k. (17) 

Note that the sets Ij, j = 1,... ,k are indeed disjoint, and each has cardinality equal to s. The 
weight of the matching Ai is 

k /c k d 

w{u,v) = Y ( 18 ) 

{u,v)eM j = l {u,Vi)GM: j = l i&Ij j = l *=1 

u£Uj 

which is equal to the objective function in (I16h . 

• Conversely, given an indicator matrix Z G Xi, let /j=supp(Z-^), and let Ij{a) denote the ath 
element in the set, a = 1,..., s (with an arbitrary ordering). Then, 

M = = 1, - • • ,s, j = 1, ■ ■ ■, A:| CE 

is a perfect matching in G. The objective value achieved by Z is equal to the weight of A4: 

j=l i=l j=l j=l 0=1 (u,v)eM 

It follows from (1181) and (|19p that to determine Z, it suffices to compute a maximum weight perfect 
matching in G. The desired support is then obtained as described in (I17p (lines 4-7 of Algorithmic]). 
This complete the proof of correctness of Algorithm [2| which proceeds in the steps described above 
to determine the support of X. 

The weighted bipartite graph G is generated in 0{d- {s ■ k)). The running time of Algorithm [2] is 
dominated by computing the maximum weight matching of G. For the case of unbalanced bipartite 
graph with \U\ = s ■ k < d = |F| the Hungarian algorithm can be modified to compute 
the maximum weight bipartite matching in time ©(iFlIlt/l + |t/plog|t/|) = 0{d- (s • A:)^). This 
completes the proof. □ 
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9.2 Guarantees of Algorithm [T] — Proof of Theorem [T] 

We first prove a more general version of Theorem [1] for arbitrary constraint sets. Combining that 
with the guarantees of Algorithm [ 2 l we prove the Theorem [TJ 

Lemma 9.2. For any real d x d rank-r PSD matrix A and arbitrary set A C let X*= 

argmaxxg;r Tr(X^AX). Assuming that there exists an operator Px ■ —?■ A such that Px{^)— 

argmaxxg;t'(xj, then Algorithm[I\ outputs X € A sueh that 

Tr(X^AX) > (1 - e) • Tr(x7AX*), 

in time Tsvd{'<') + 0((|)^ ^ • {Tx + fcd)), where Tx is the time required to compute Px{-) o-nd TsvD{r) 
the time required to compute the truncated SVD of A. 

Proof. Let A = UAU^ denote the truncated eigenvalue decomposition of A; A is a diagonal r x r 
whose ith diagonal entry An is equal to the ith largest eigenvalue of A, while the columns of U 
contain the corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x E 

x’^Ax = ||A^'^^u"^x ||2 > (A^^'^U^x, c)^ Vc € M'’: ||c ||2 = 1. (20) 

_]^/2_ 

In fact, equality in (1201) is achieved for c colinear to A Ux, and hence, 

x^Ax = max (A^/"u^x, c>^ ( 21 ) 

cGSr^ 


In turn. 


Tr(^X^Ax) 


k 

VX^'^AX^' 


j = l 


k 

max y/A^^^U^X^', C^V. 


( 22 ) 


Recall that X* is the optimal solution of the trace maximization on A, i.e., 

X*= argmaxTR("x^Ax') . 
xex ^ ' 

Let C* be the maximizing value of C in (I22p for X = X*, he., C* is an r x fc matrix with unit-norm 
columns such that for all j E {1,... , A:}, 

Xi^AXi = (A'/y^Xi, (23) 

Algorithm [1] iterates over the points (r x k matrices) C in A/"®^ cartesian power of 

a finite '^/ 2 -net of the r-dimensional / 2 -URit sphere. At each such point C, it computes a candidate 

k 

X = argmaxy(X^UAi/2cjy 

XGA^ i=i 

via Algorithm [2] (See Lemma 19.11 for the guarantees of Algorithmic]). By construction, the set 
A/"®^ (^ 2 ~^) contains a Cjj such that 

||C(t - C*||oo ,2 = max IIC-J - Ci ||2 < e/ 2 . (24) 
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Based on the above, for all j E {1,..., A;}, 




1/2 


< 


< 


< 


(A'/"u^Xi, Ci)| 

(A'/"u^xi, q> + (A'/'u^xi, {Ci - q))| 
(A'/"u^xi, q)| + KA'/'u^xi, (Ci - q))| 


(A^/^U^Xi, C^l + ||A^^"U' Xill • ||Ci - O 


A V.7 


(A'/^U^Xi, Cn| + (6/2) • (Xi ' AXi) 


vri^ A vrit 


(25) 


The first step follows by the definition of C*, the second by the linearity of the inner prodnct, 
the third by the triangle inequality, the fourth by Cauchy-Schwarz inequality and the last by 
Rearranging the terms in 


|(A'/'u^Xi, Cj)| > (1 - f) • (Xi^AXi)'/' > 0, 

and in turn, 

(A'/'u^Xi, Cj)' > (1 - f)' • Xi^'AXi > (1 - 6) • Xi^AXi 
Summing the terms in (I26|) over all j E {1,..., fc}, 

^(A'/'u^xi, q>' > (1 - 6) • tr(x:ax,) . 

i=i 

Let Xjj E A be the candidate solution produced by the algorithm at C(j, he., 

k 

x&x 


(26) 


(27) 


Xjj= argmax^^(xj, UA qV 

i=i 


(28) 


Then, 


Tr( X«' AXtt ) 


max 

C:C7esrMp'i 


c^>= 


> cj>= 

i=i 

> E(xi. CA^'^cj/ 

1=1 
iS) 


> (l-e)-TR X/AX 


(29) 

where (a) follows from the observation in (j22p . (/?) from the sub-optimality of Cjj, ( 7 ) by the 
definition of Xjj in ([281) . while (<5) follows from (l271l . According to (f29]l . at least one of the candidate 
solutions produced by Algorithm [H namely Xp, achieves an objective value within a multiplicative 
factor (1 — e) from the optimal, implying the guarantees of the lemma. 

Finally, the running time of Algorithm [1] follows immediately from the cost per iteration and 
the cardinality of the '^/ 2 -net on the unit-sphere. Note that matrix multiplications can exploit the 
singular value decomposition which is performed once. □ 
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Theorem [H For any real dx d rank-r PSD matrix A, desired number of components k, number s 
of nonzero entries per component, and accuracy parameter e G (0,1), AlgorithmUl outputs X G 
such that 


Tr(X^AX) > (1 - e) • Tr(xJAX*), 

where X*= argmaxxg;^^ Tr(X^AX) , in time Tsvair) + 0[{^y ^ ■ d-{s ■ kY). Tsvair) is the time 
required to compute the truncated SVD of A. 

Proof. Recall that is the set of d x A: matrices X whose columns have unit length and pairwise 
disjoint supports. Algorithm [2l given any W G computes X G that optimally solves the 

constrained maximization in line 5. (See Lemma l9.ll for the guarantee of Algorithm [ 2 ]) . in time 
0[d- {s ■ ky'j. The desired result then follows by Lemma 19.21 for the constrained set A^. □ 

9.3 Guarantees of Algorithmic] — Proof of Theorem [2] 

We prove Theorem [2] with the approximation guarantees of Algorithm [3l 
Lemma 9.3. For any dx d PSD matrices A and A, and any set A C let 

X*= argmaxTR^X^Ax') , and X*=argmaxTR(X^AX). 
xe.v '' xeA" 

Then, for any X G A such that Tr(x'''aX) > 7 • Tr(XJAX*) for some 0 < 7 < 1, 

Tr(X^AX) > 7 • Tr(xJAX*) - 2 • ||A - A ||2 • max ||X|||. 

Proof. By the optimality of X* for A, 

tr(^xJax*) > tr(^xJax*) . 

In turn, for any X G A such that Tr^X^AX^ > 7 • Tr(xJAX*) for some 0 < 7 < 1, 

tr(^x^ax) > 7 • Tr(^XJAX*) . 

Let E=A — A. By the linearity of the trace, 

Tr(^X^Ax) = Tr(^X^Ax) - Tr(^X^Ex) 

< tr(^x^ax) + |tr(^x^ex) |. 

By Lemma flO. 101 

|Tr(x"^Ex) I < ||X||f • ||X||f • ||E ||2 < ||E ||2 • max ||X||| = R. 

Continuing from (1311) . 

Tr(^X^Ax) < Tr(^X^Ax) PR. 


(30) 

(31) 

(32) 

(33) 


19 




Similarly, 


tr(^xJax*) = tr(^xJax*) - tr(^xJex*) 

> Tr(^XJAX*) - |tr(^xJex*) I 

> Tr(^xJAX*) -i?. (34) 

Combining the above, we have 

Tr(^X^Ax) > Tr(^X^Ax) - R 

> 7-Tr(^XJAX*) -R 

> 7 • (^Tr(^xJAX*) -R^-R 
= 7 -Tr(x;AX*) -(l + 7 )-i? 

> 7-Tr(^xJAX*) - 2-R, 

where the first inequality follows from (|33l) the second from (I30p . the third from (|34p . and the last 
from the fact that i? > 0 and 0 < 7 < 1. This concludes the proof. □ 

Remark 9.2. If in Lemma [g.3l the PSD matrices A and A G are such that A — A is also 

PSD, then the following tighter bound holds: 

k 

Tr(X^AX) > 7 • Tr(xJax*) - ^ Ai (a - a) . 

i=l 

Proof. This follows from the fact that if E=A — A is PSD, then 

d 

Tr(^X^Ex) = ^xJExj > 0, 
i=i 

and the bound in (IHip can be improved to 

Tr(^X^Ax) = Tr(^X^Ax) - Tr(^X^Ex) < Tr(^X^Ax) . 

Further, by Lemma llO.lll the bound in (I32p can be improved to 

k 

Tr(X^EX) < ^ Ai(E) = R. 

i=l 

The rest of the proof follows as is. □ 

Theorem [ 2 I For any nx d input data matrix S, with corresponding empirical covariance matrix 
A = i/n • S'''S, any desired number of components k, and accuracy parameters e € (0,1) and r, 
Algorithmic outputs X(,,) G such that 

Tr(x;^)AX„) > (l-e)-TR(XjAX*)-2-A:-IIA-AII 2 , 

where X*= argmaxxg;^* Tr(X'^AX), in time TsKETCH{r) + Tswir) + ■ d ■ {s ■ k^). 

Proof. The theorem follows from Lemma f9.3l and the approximation guarantees of Algorithm [TJ □ 
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9.4 Proof of Theorem [3] 


First, we restate and prove the following Lemma by 



Lemma 9.4. Let A € be an positive semidefinite matrix with entries in [—1,1], and'V G 
matrix such that A = VV^. Consider a random matrix R G with entries drawn according to 
a Gaussian distribution N{0,l/r), and define 


A = VRR^V^. 


Then, for r = 0{e ^ log d), 

|[A]ij- — [A]jj| < 

for all i,j with probability at least 1 — 1/d. 


Proof. The proof relies on the Johnson-Lindenstrauss (JL) Lemma 381], according to which for any 
two unit norm vectors x, y G and R generated as described 


Pr{|x^RR'y - x^yj > e} < 2 


,-(e2-e3)-r/4 


Observe that each element of A is in [—1,1], hence can be rewritten as an inner product of two 
unit-norm vectors: 


[A]*,, = 


Setting r = 0{e ^logd) and using the JL lemma and a union bound over all 0{d?) vector pairs 
V:,i, V,, j we obtain the desired result. □ 


Next, we provide the proof of Theorem[3]for the simple case of A: = 1; the proof easily generalizes 
to the multi-component case A: > 1. According to Lemma 19.41 choosing d = O ((J/6)“^ logn) = 
O (log n) suffices for all entries of A constructed as described in the lemma to satisfiy 


|[A]jj [A]jj| < g 

with probability at least 1 — 1/d. In turn, for any s-sparse, unit-norm x. 




e 

n n 

|x^Ax — x^Ax| = 

XiXj ( [A] [A] ij ) 

0 

< - ■ 
- 6 




2=1 j = l 


< ^ • llxjl? < ^ • (Vs • ||x||2)^ ^ ^ 

where the second inequality follows from the fact that x is s-sparse and unit norm. 

We run Algorithm[T](for A: = 1) with input argument the rank-r matrix A, desired sparsity s and 
accuracy parameter e = 6/6. Algorithm [1] outputs a s-sparse unit-norm vector x which according 
to Theorem [T] satisfies 


(1 - s/e) . Xd"^Axrf < x'^Ax < Xrf’^Axd, (36) 

where x^ is the true s-sparse principal component of A. This, in turn, implies that x satisfies 


x^ Ax - Xrf"^ Axd 


< - • XiJ Axd <-(l + -)s<--s 


6 


6 


6 


(37) 
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where the second inequality follows from the fact that the entries of A lie in [—1 — 1 + |] and x 

is s-sparse and unit-norm. 

In the following, we bound the difference of the performance of x on the original matrix A from 
the optimal value. Let x* denote the s-sparse principal component of A and define 

OPT=x/Ax,,. 

Then, 

|OPT — x'^'AxI = |xJAx* — x'''Ax| 

= |xjAx* - xjAxrf Xd^Axd - x"^Ax| 

< |xJAx* - Xrf’^Axrfl |xrf"^Axd - x'^AxI . (38) 

'-V-' '-V-" 

A B 

Utilizing (|35l) and the triangle inequality, one can verify that 

A = |x*'^Ax* - Xd^Axd Xd^Axd - Xd^Axd\ 

< |x*’^Ax* - Xrf’^Axdl Ix^’^Axd - Xd’^Axrfl 

T . T * ^ 

< X* Ax* - Xd Axd + ■ s 

'-V-' 6 

>0 

< x*'^Ax* - Xd^Axd + 77 • s + Xd^Axd - x*'^Ax* 

6 '-^' 

>0 

< x*'^Ax* - x*'^Ax* 77 • s -h Xd^Axd - Xd^ Axd 

b 

< |x*'^Ax* - x*'^Ax*| 77 • s -h |xd"^AX(i - Xd'''AX(i| 

D 

< ^ • s. (39) 

Similarly, 


B = 


< 


Xd^ Axd — X ' Ax -|- X ' Ax — X ' Ax 


7TX 


TXi 


;T 


Xd~^ Axd - x"^ Ax 


Xd^Axd - x"^Ax 


+ 


X Ax — X Ax 


<5 

+ 6-" 


(“) 25 5 

< — • s -I— • s 
“6 6 

5 

~2^' 

where (a) follows from (j37l] . Continuing from ([38]), combining (I39p and ()4n|] we obtain 

|0PT —x^Ax < 5 ■ s, 


(40) 


which is the desired result. 
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10 Auxiliary Technical Lemmata 

Lemma 10.5. For any real d x n matrix M, and any r,k < min{(i, n} 


r+fc 

E ^ 

2=r+l 


k 


y/r + k 

where (Ti(M) is the ith largest singular value o/M. 
Proof. By the Cauchy-Schwartz inequality, 


• IIMI 


r-\-k 


r-\-k 


r-\-k 


1/2 


r-\-k 


1/2 


E ^*(M) = E < E 


i=r-\-l 


2=r+l 


\i=r-\-l 


\ik\\2 = Vk- E 

\i=r+l / 


Note that (Tr+i(M),... , (Tr+fc(M) are the k smallest among the r + k largest singular values. Hence, 


r+fc 


k 


r-\-k 


k 


min{d,n} 




k 


r + k 


r + k 




i=r-\-l 2=1 2=1 

Combining the two inequalities, the desired result follows. 

Corollary 1. For any real dxn matrix M and k < min{ (i,n}, (Tfc(M) < k~^l’^ ■ ||M||f. 
Proof. It follows immediately from Lemma 110.51 


□ 


□ 


Lemma 10.6. Let ai,..., and bi,... ,bn be 2n real numbers and let p and q be two numbers 
such that 1/p+ !/(/ = 1 and p > 1. We have 

\ i/p / n 

1 &•'’ 

Lemma 10.7. For any two real matrices A and B of appropriate dimensions, 


IE' 

2 = 1 




||AB|| F < min{|| A||2 ||B||f, ||A||f||B||2} . 

Proof. Let bj denote the zth column of B. Then, 

||AB||| = El|Abi||i < Ell^llall^illi = IIA||i E = IAIlil|B|||. 

i i i 

Similarly, using the previous inequality, 

||AB||| = llB^A^Ill < ||B^||i||A^||| = ||B||i||A|||. 

Combining the two upper bounds, the desired result follows. □ 

Lemma 10.8. For any A,B e 

|(A,B)|^|Tr(^A^b)| < ||A||f||B||f. 

Proof. The inequality follows from Lemma 110.61 for p = q = 2, treating A and B as vectors. □ 
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Lemma 10.9. For any real m x n matrix A, and any k < min{m, n} 


max ||AY||f 

YgRnxfc 

YTY=Ifc 



The maximum is attained by Y coinciding with the k leading right singular vectors of A. 

Proof. Let be the singular value decomposition of A; U and V are m x m and n x n 

unitary matrices respectively, while E is a diagonal matrix with Tijj = Oj, the jih. largest singular 
value of A, j = 1,..., d, where d= min{m, n}. Due to the invariance of the Frobenius norm under 
unitary multiplication, 

||AY||| = IIUEV’^YIII = ||SV^Y|||. (41) 


Continuing from (|4ip . 


EV’^YllI = Tr( Y ' VE^V ‘ Y 1 = 


T ■\r-^2\rT 


K a 

1=1 j=i 


4 ■ ''jE iy‘ = 


Ed'Eb, 

j=i 1=1 


Jy^ 


Let Zj=Yf!i=i j = 1,... ,d. Note that each individual Zj satisfies 

k 

0 < Zj='^ (vjYi) <||vjf = l, 


2 = 1 


where the last inequality follows from the fact that the columns of Y are orthonormal. Further, 


j=i j=i1=1 


d k ry k d ry 

^ / T, 


Jyii = 


EE 

1=1 j=l 


v,yi = 




2=1 


Combining the above, we conclude that 


I AY||| — <jf ■ Zj < af + ... + 
i=i 


(42) 


Finally, it is straightforward to verify that if = Vj, f = 1,..., A;, then (14211 holds with equality. □ 

Lemma 10.10. For any real dxn matrix A, and pair of dxk matrix X and nxk matrix Y such 
that X'^^X = Ifc and Y'''Y = 1^ with k < min{d, n}, the following holds: 


|TR(x■^AY)|<^/fc•(^u2(A)) 


1/2 


2=1 


Proof. By Lemma 110.81 

|(X, AY)| = ITr^X'^Ay')! < ||X||f • ||AY||f = Vk ■ ||AY||f. 


where the last inequality follows from the fact that ||X||p = Tr(X^X) = TR(Ifc) = k. Combining 
with a bound on ||AY||f as in Lemma 110.91 completes the proof. □ 
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Lemma 10.11. For any real d x d PSD matrix A, and k x d matrix X with k < d orthonormal 
columns, 


k 

Tr(^X^Ax) < ^Ai(A) 

^=1 

where Aj(A) is the ith largest eigenvalue of A. Equality is achieved for X coinciding with the k 
leading eigenvectors of A. 

Proof Let A = VV^ be a factorization of the PSD matrix A. Then, Tr(X'^ AX) = Tr(X’^VV'T'X) 
||V~''X|||. The desired result follows by Lemma 110.91 and the fact that Ai(A) = affV), i = 

1,... ,d. □ 

11 Additional Experimental Results 


k = 6 components, s = 10 nnz/component 


s = 10 nnz/component 




3 4 5 6 7 

Number of target components 

(b) 

Figure 3: Cumulative variance captured W k s-spars components computed on the word-by-word 
matrix - BagOfWords:NIPS dataset 371]. Sparsity is arbitrarily set to s = 10 nonzero entries 
per component. Fig. 3(a) depicts the cum. variance captured by A: = 6 components. Deflation leads 
to a greedy formation of components; first components capture high variance, but subsequent ones 
contribute less. On the contrary, our algorithm jointly optimizes the k components and achieves 
higher total cum. variance. Fig. |3(b)| depicts the total cum. variance achieved for various values 
of k. Our algorithm operates on a rank-4 approximation of the input. 
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SPCABiPart SpanSPCA TPower 



Topic 1 

Topic 2 

Topic 3 

Topic 4 

Topic 5 

Topic 6 

Topic 7 

Topic 8 

i 

network 

algorithm 

neuron 

parameter 

object 

classifier 

word 

noise 

2 

model 

data 

cell 

point 

image 

net 

speech 

control 

3 

learning 

system 

pattern 

distribution 

recognition 

classification 

level 

dynamic 

g 4 

input 

error 

layer 

hidden 

images 

class 

context 

step 

1 5 

function 

weight 

information 

space 

task 

test 

hmm 
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^ 6 

neural 

problem 

signal 

gaussian 

features 

order 
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7 

unit 

result 

visual 

linear 

feature 

examples 
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component 

8 

set 

number 

field 

probability 

representation 

rate 
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equation 

9 

training 
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synaptic 

mean 

performance 

values 

approach 

single 

10 

output 

vector 

firing 

case 

view 

experiment 

trained 

analysis 

11 

network 

algorithm 

neuron 

parameter 

recognition 

control 

classifier 

noise 

12 

model 

data 

cell 

distribution 

object 

action 

classification 

order 

0l4 

input 

weight 

pattern 

point 

image 

dynamic 

class 

term 

learning 

error 

layer 

linear 

word 

step 

net 

component 

§115 

neural 

problem 

signal 

probability 

performance 

optimal 

test 

rate 

^ 16 

function 

output 

information 

space 

task 

policy 

speech 

equation 

m 17 

unit 

result 

visual 

gaussian 

features 

states 

examples 

single 

18 

set 

number 

synaptic 

hidden 

representation 

reinforcement 

approach 

analysis 

19 

system 

method 

field 

case 

feature 

values 

experiment 

large 

20 

training 

vector 

response 

mean 

images 

controller 

trained 

form 

21 

data 

function 

neuron 

unit 

learning 

network 

model 

training 

22 

distribution 

algorithm 

cell 

weight 

space 

input 

parameter 

hidden 

h23 

gaussian 

set 

visual 

layer 

action 

neural 

information 

performance 

< 24 

25 

26 

probability 

error 

direction 

net 

order 

system 

control 

recognition 

component 

problem 

firing 

task 

step 

output 

dynamic 

classifier 

approach 

result 

synaptic 

connection 

linear 

pattern 

mean 

test 

analysis 

number 

response 

activation 

case 

signal 

noise 

word 

”28 

mixture 

method 

spike 

architecture 

values 

processing 

field 

speech 

29 

likelihood 

vector 

activity 

generalization 

term 

image 

local 

classification 

30 

experiment 

point 

motion 

threshold 

optimal 

object 

equation 

trained 


Total Cum. Variance 

TPower 

2.5999 ■ 10^ 

SpanSPCA 

2.5981 ■ 10® 

SPCABiPart 

3.2090 ■ 10 ® 


Table 4: BagOfWords:NIPS dataset [^. We run various SPCA algorithms for A; = 8 com¬ 
ponents (topics) and s = 10 nonzero entries per component. The table lists the words selected by 
each component (words corresponding to higher magnitude entries appear higher in the topic). Our 
algorithm was configured to use a rank-4 approximation of the input data. 
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SPCABiPart SpanSPCA TPower 
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Topic 3 

Topic 4 

Topic 5 

Topic 6 

Topic 7 

Topic 8 

i 
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zzz.bush 

team 
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2 
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7 
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8 
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executives 
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Israelis 

9 
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10 
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zzz_white_house 

win 

teacher 

found 
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11 
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zzz.bush 
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won 

12 

company 

game 
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student 
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night 
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million 

season 

president 

zzz_united_states 

children 

add 

part 

left 
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companies 

player 

zzz_george_bush 

zzz_u_s 

program 

tablespoon 

look 

big 

(n 15 

market 

play 

campaign 

military 

home 

teaspoon 

need 

put 

§16 

stock 

games 

official 

leader 

family 

oil 

book 

win 

cnl7 

business 

point 

government 

zzz Jsrael 

women 

pepper 

called 

hit 

18 

money 

run 

political 

zzz_american 

public 

water 

hour 

job 

19 

billion 

right 

election 

war 

high 

large 

american 

ago 

20 

plan 

coach 

group 

country 

law 

sugar 

help 

zzz_new_york 

21 

percent 

zzz_united_states 

zzz.bush 

company 

team 

cup 

school 

zzz_al_gore 

22 

million 

zzz_u_s 

official 

companies 

game 

minutes 

student 

zzz_george_bush 

S23 

money 

zzz_american 

government 

market 

season 

add 

children 

campaign 

1^24 

high 

attack 

president 

stock 

player 

tablespoon 

women 

election 

m25 

program 

military 

group 

business 

play 

oil 

show 

plan 

o26 

number 

Palestinian 
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billion 

point 

teaspoon 

book 

tax 

.^27 

need 

war 
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run 
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family 

public 

28 
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pepper 
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zzz.washington 

29 

problem 

zzz_white_house 
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home 
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30 

com 

games 

law 

cost 

won 

food 

small 
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Total Cum. Variance 

TPower 

45.4014 

SpanSPCA 

46.0075 

SPCABiPart 

47.7212 


Table 5: BagOfWords:NyTimes dataset [371]. We run various SPCA algorithms for fc = 8 


components (topics) and s = 10 nonzero entries per component. The table lists the words selected 
by each component (words corresponding to higher magnitude entries appear higher in the topic). 
Our algorithm was configured to use a rank-4 approximation of the input data. 
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Total Cum. Variance 

TPower 

48.140645 

SpanSPCA 

48.767864 

SPCABiPart 

51.873063 


Table 6: BagOfWords:NyTimes dataset [371]. We run various SPCA algorithms for fc = 8 


components (topics) and cardinality s = 15 per component. The table lists the words corresponding 
to each component (words corresponding to higher magnitude entries appear higher in the topic). 
Our algorithm was configured to use a rank-4 approximation of the input data. 
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night 

voter 

security 

found 

hour 

industry 

cancer 

official 

hit 

leader 

zzz Jsraeli 

friend 

pan 

room 

hospital 

need 

guy 

public 

nation 

ago 

fat 

small 

treatment 

right 

yard 

zzz_republican 

member 

question 

bowl 

car 

scientist 

part 

played 

presidential 

support 

teacher 

gram 

zzzjnternet 

according 

cost 

look 

federal 

called 

case 

food 

place 

blood 

system 

start 

zzz.washington 

forces 

number 

medium 

film 

heart 

Palestinian 

percent 

zzz_al_gore 

cup 

school 

team 

company 

official 

zzzjsrael 

million 

zzz.bush 

minutes 

right 

game 

companies 

government 

zzzjsraeli 

money 

zzz_george_bush 

add 

group 

season 

market 

president 

zzz_yasser_arafat 

billion 

campaign 

tablespoon 

show 

player 

stock 

zzz_united.states 

peace 

business 

election 

oil 

home 

play 

zzz.enron 

zzz_u.s 

war 

fund 

political 

teaspoon 

high 

games 

analyst 

attack 

terrorist 

tax 

zzz_white_house 

pepper 

program 

point 

firm 

zzz.american 

zzz.taliban 

cost 

administration 

water 

need 

run 

industry 

country 

zzz_afghanistan 

cut 

republican 

hour 

part 

coach 

investor 

law 

forces 

job 

leader 

large 

com 

win 

sales 

plan 

bin 

pay 

vote 

sugar 

american 

won 

customer 

public 

troop 

economy 

democratic 

serving 

look 

left 

price 

zzz.washington 

laden 

deal 

presidential 

butter 

help 

night 

investment 

member 

student 

big 

zzz.clinton 

chopped 

problem 

hit 

quarter 

system 

zzz.pakistan 

chief 

support 

pan 

called 

guy 

executives 

nation 

product 

executive 

zzz.congress 

fat 

zzz_new_york 

yard 

consumer 

case 

zzzjnternet 

financial 

military 

bowl 

number 

played 

technology 

federal 

profit 

start 

policy 

gram 

question 

ball 

share 

information 

earning 

record 

court 

food 

ago 

playing 

prices 

power 

shares 

manager 

security 

league 

told 

lead 

growth 

effort 


Total Cum. Variance 

TPower 

50.7686 

SpanSPCA 

52.8117 

SPCABiPart 

54.8906 


Table 7: BagOfWords:NyTimes dataset [371]. We run various SPCA algorithms for fc = 8 


components (topics) and cardinality s = 20 per component. The table lists the words corresponding 
to each component (words corresponding to higher magnitude entries appear higher in the topic). 
Our algorithm was configured to use a rank-4 approximation of the input data. 
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fc = 6 components, s = 10 nnz/component 



(a) 



Figure 4: Cumulative variance captured by k s-sparse {s = 10) extracted components on the 
word-by-word matrix - BagOfWords:NyTimes dataset [^. Fig. 4(a) depicts the cum. variance 
captured by /c = 6 components. Deflation leads to a greedy formation of components; first compo¬ 
nents capture high variance, but subsequent ones contribute less. On the contrary, our algorithm 
jointly optimizes the k components and achieves higher total cum. variance. Fig. |4(b)| depicts the 
total cum. variance achieved for various values of k. Sparsity is arbitrarily set to s = 10 nonzero 
entries per component. Our algorithm operates on a rank-4 approximation. 


fc = 6 components, s = 15 nnz/component 



s = 15 nnz/component 



(b) 


Figure 5: Same as Fig. 01 but for sparsity s = 15. 
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Cumulative Expl. Variance 


fc = 6 components, s = 20 nnz/component 



s = 20 nnz/component 


60 



2 3 4 5 6 7 8 

Number of target components 


(a) (b) 

Figure 6: Same as Fig. 01 but for sparsity s = 20. 
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